The project I am angling towards deals with my website's traffic and taxonomy data. I will be trying to build models that can accurately predict which tags are best for specific channels of traffic, and will also be investigating the longitudinal nature of an article's lifecycle (days to 90% of traffic is my current threshold for an article being done, but I will refine that with some descriptive statistics).

I'll also want to investigate what kind of content performs best in each month.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

First, import the table of tag-article mappings from our SQL db


In [2]:
df = pd.read_csv('atlas-taggings.csv')

In [3]:
df.head(10)


Out[3]:
tag_id tag_url tagged_type tagged_id tagged_url
0 36 www.atlasobscura.com/categories/abandoned Place 9982 www.atlasobscura.com/places/athens-olympic-spo...
1 2 www.atlasobscura.com/categories/panoramas Place 1676 www.atlasobscura.com/places/velaslavasay-panorama
2 2 www.atlasobscura.com/categories/panoramas Place 6431 www.atlasobscura.com/places/gettysburg-cyclorama
3 2 www.atlasobscura.com/categories/panoramas Article 2311 www.atlasobscura.com/articles/rip-gettysburg-c...
4 258 www.atlasobscura.com/categories/bridges Place 10134 www.atlasobscura.com/places/gimbel-s-bridge
5 2 www.atlasobscura.com/categories/panoramas Place 6430 www.atlasobscura.com/places/borodino-panorama
6 2 www.atlasobscura.com/categories/panoramas Place 6428 www.atlasobscura.com/places/panorama-mesdag
7 2 www.atlasobscura.com/categories/panoramas Place 3688 www.atlasobscura.com/places/panorama-raclawice
8 3 www.atlasobscura.com/categories/disasters Place 6343 www.atlasobscura.com/places/mars-bluff-crater
9 4 www.atlasobscura.com/categories/atom-bombs Place 6343 www.atlasobscura.com/places/mars-bluff-crater

In [4]:
articles = df[df.tagged_type == 'Article']

We only care about the articles for this analysis. Place entries are outside scope.


In [5]:
articles.head()


Out[5]:
tag_id tag_url tagged_type tagged_id tagged_url
3 2 www.atlasobscura.com/categories/panoramas Article 2311 www.atlasobscura.com/articles/rip-gettysburg-c...
56 27 www.atlasobscura.com/categories/objects-of-int... Article 2227 www.atlasobscura.com/articles/objects-of-intri...
57 27 www.atlasobscura.com/categories/objects-of-int... Article 2268 www.atlasobscura.com/articles/objects-of-intri...
58 27 www.atlasobscura.com/categories/objects-of-int... Article 2213 www.atlasobscura.com/articles/objects-of-intri...
62 27 www.atlasobscura.com/categories/objects-of-int... Article 2216 www.atlasobscura.com/articles/objects-of-intri...

Extract the tag name from the tag's URL


In [34]:
def get_tag(x):
    return x.split('/')[2]

#changing this function to get_tag_name() in module.


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-e82a2969e1fb> in <module>()
      2 def get_tag(x):
      3     return x.split('/')[2]
----> 4 tag_mapping.tag_url = tag_mapping.tag_url.apply(get_tag)
      5 #changing this function to get_tag_name() in module.

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2235             values = lib.map_infer(values, boxer)
   2236 
-> 2237         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2238         if len(mapped) and isinstance(mapped[0], Series):
   2239             from pandas.core.frame import DataFrame

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:63043)()

<ipython-input-34-e82a2969e1fb> in get_tag(x)
      1 tag_mapping.head()
      2 def get_tag(x):
----> 3     return x.split('/')[2]
      4 tag_mapping.tag_url = tag_mapping.tag_url.apply(get_tag)
      5 #changing this function to get_tag_name() in module.

IndexError: list index out of range

Create a tag_url column that just has the tag's name


In [10]:
articles.tag_url = articles.tag_url.apply(get_tag)
articles.head()


Out[10]:
tag_id tag_url tagged_type tagged_id tagged_url
3 2 panoramas Article 2311 www.atlasobscura.com/articles/rip-gettysburg-c...
56 27 objects-of-intrigue Article 2227 www.atlasobscura.com/articles/objects-of-intri...
57 27 objects-of-intrigue Article 2268 www.atlasobscura.com/articles/objects-of-intri...
58 27 objects-of-intrigue Article 2213 www.atlasobscura.com/articles/objects-of-intri...
62 27 objects-of-intrigue Article 2216 www.atlasobscura.com/articles/objects-of-intri...

Get dummies for each tag


In [11]:
test = pd.get_dummies(articles.tag_url)

In [12]:
test.head()


Out[12]:
100-wonders 19th-century 2016-election 30-rock 31-days-of-halloween abandoned abandoned-amusement-parks abandoned-brooklyn abandoned-cemetaries abandoned-hospitals ... world-s-smallest world-s-tallest world-war-ii wunderkammer wwi wwii yehlui-geological-park yeti zombies zoos
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
62 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 983 columns

Join the dummies back to the main dataframe


In [13]:
articles = articles.join(test)

In [14]:
articles.drop(['tag_id','tag_url','tagged_type','tagged_id'],axis=1,inplace=True)

In [15]:
articles.head()


Out[15]:
tagged_url 100-wonders 19th-century 2016-election 30-rock 31-days-of-halloween abandoned abandoned-amusement-parks abandoned-brooklyn abandoned-cemetaries ... world-s-smallest world-s-tallest world-war-ii wunderkammer wwi wwii yehlui-geological-park yeti zombies zoos
3 www.atlasobscura.com/articles/rip-gettysburg-c... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 www.atlasobscura.com/articles/objects-of-intri... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57 www.atlasobscura.com/articles/objects-of-intri... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58 www.atlasobscura.com/articles/objects-of-intri... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
62 www.atlasobscura.com/articles/objects-of-intri... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 984 columns

De-dupe articles but maintain the tagging data using groupby and sum


In [16]:
unique_articles = articles.groupby('tagged_url').sum() #made into func

In [17]:
unique_articles = unique_articles.reset_index()

In [18]:
unique_articles = unique_articles.set_index('tagged_url')

Using a csv generated by a script I wrote that queries Google Analytics for pageviews per article from publish date to n-days post-publication, import pageview data and join it to the tag/article DataFrame


In [19]:
#now we need the pageviews and have to map the URLs to Page Titles
pageviews = pd.read_csv('output_articles_performance.csv',header=None,names=['url','published','pageviews'])
pageviews.head()
#In the future I should import the module and run it here instead of grabbing.


Out[19]:
url published pageviews
0 jamaica-may-get-rid-of-queen-elizabeth-and-fin... 2016-04-15 3997
1 trippy-blacklight-posters-from-the-psychedelic... 2016-04-15 7042
2 leonardo-da-vincis-living-descendants-have-bee... 2016-04-15 12448
3 catapult-into-the-weekend-like-this-gopro-off-... 2016-04-15 4187
4 cat-rescued-after-4-days-stuck-on-insanely-tal... 2016-04-15 2721

In [20]:
pageviews.url = ['www.atlasobscura.com/articles/' + x for x in pageviews.url]

In [21]:
pageviews.head()


Out[21]:
url published pageviews
0 www.atlasobscura.com/articles/jamaica-may-get-... 2016-04-15 3997
1 www.atlasobscura.com/articles/trippy-blackligh... 2016-04-15 7042
2 www.atlasobscura.com/articles/leonardo-da-vinc... 2016-04-15 12448
3 www.atlasobscura.com/articles/catapult-into-th... 2016-04-15 4187
4 www.atlasobscura.com/articles/cat-rescued-afte... 2016-04-15 2721

In [22]:
pageviews.describe()


Out[22]:
pageviews
count 3446.000000
mean 7052.891759
std 23256.215270
min 1.000000
25% 1150.250000
50% 2571.500000
75% 5834.750000
max 621494.000000

Set the pageviews index to the url column to make joining easy


In [23]:
pageviews.set_index('url',inplace=True)

In [24]:
article_set = unique_articles.join(pageviews)

In [25]:
article_set.head()


Out[25]:
100-wonders 19th-century 2016-election 30-rock 31-days-of-halloween abandoned abandoned-amusement-parks abandoned-brooklyn abandoned-cemetaries abandoned-hospitals ... world-war-ii wunderkammer wwi wwii yehlui-geological-park yeti zombies zoos published pageviews
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-01 651.0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-09 3505.0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-12 840.0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-30 4037.0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-01-07 1620.0

5 rows × 985 columns

Reset index


In [26]:
article_set.reset_index()


Out[26]:
tagged_url 100-wonders 19th-century 2016-election 30-rock 31-days-of-halloween abandoned abandoned-amusement-parks abandoned-brooklyn abandoned-cemetaries ... world-war-ii wunderkammer wwi wwii yehlui-geological-park yeti zombies zoos published pageviews
0 www.atlasobscura.com/articles/10-little-known-... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-01 651.0
1 www.atlasobscura.com/articles/10-of-the-greate... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-09 3505.0
2 www.atlasobscura.com/articles/10-places-12-yea... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-12 840.0
3 www.atlasobscura.com/articles/10-things-that-y... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-30 4037.0
4 www.atlasobscura.com/articles/100-wonders-a-vi... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-01-07 1620.0
5 www.atlasobscura.com/articles/100-wonders-an-i... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-20 4049.0
6 www.atlasobscura.com/articles/100-wonders-batt... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-17 2727.0
7 www.atlasobscura.com/articles/100-wonders-bloo... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-15 1290.0
8 www.atlasobscura.com/articles/100-wonders-clow... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-21 1450.0
9 www.atlasobscura.com/articles/100-wonders-dese... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-03 2635.0
10 www.atlasobscura.com/articles/100-wonders-devi... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-19 2886.0
11 www.atlasobscura.com/articles/100-wonders-edis... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-12 1600.0
12 www.atlasobscura.com/articles/100-wonders-its-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-28 792.0
13 www.atlasobscura.com/articles/100-wonders-last... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-10 3267.0
14 www.atlasobscura.com/articles/100-wonders-mode... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-30 6290.0
15 www.atlasobscura.com/articles/100-wonders-necr... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-15 8366.0
16 www.atlasobscura.com/articles/100-wonders-new-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-18 10149.0
17 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-23 1625.0
18 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-02-04 998.0
19 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-17 3102.0
20 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-26 2396.0
21 www.atlasobscura.com/articles/100-wonders-the-... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-24 2490.0
22 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-25 1294.0
23 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-09 2509.0
24 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-08 4845.0
25 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-22 2366.0
26 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-10 5815.0
27 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-14 1654.0
28 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-18 3708.0
29 www.atlasobscura.com/articles/100-wonders-the-... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-27 1841.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2783 www.atlasobscura.com/articles/williamsburg-sav... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-02-12 1101.0
2784 www.atlasobscura.com/articles/winters-effigies... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-12-17 16565.0
2785 www.atlasobscura.com/articles/wishing-trees 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-07-30 5333.0
2786 www.atlasobscura.com/articles/without-people-p... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-02-17 3408.0
2787 www.atlasobscura.com/articles/wolhusen-mortuar... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-01-19 2469.0
2788 www.atlasobscura.com/articles/wonderland-lost-... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-05-16 7380.0
2789 www.atlasobscura.com/articles/wonders-of-polar... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-01-28 1883.0
2790 www.atlasobscura.com/articles/woody-guthries-w... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-04-03 3541.0
2791 www.atlasobscura.com/articles/woolly-mammoth-o... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-07-16 1740.0
2792 www.atlasobscura.com/articles/working-at-a-coo... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-29 6031.0
2793 www.atlasobscura.com/articles/world-record-fil... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-02 3258.0
2794 www.atlasobscura.com/articles/world-s-largest-... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-12-04 93.0
2795 www.atlasobscura.com/articles/world-s-oldest-b... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-10-15 4814.0
2796 www.atlasobscura.com/articles/world-wingsuit-l... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-14 1135.0
2797 www.atlasobscura.com/articles/worlds-fair-reli... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-01-07 2233.0
2798 www.atlasobscura.com/articles/worldwide-scotch... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-06 23143.0
2799 www.atlasobscura.com/articles/wrapping-armchai... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-02-03 1409.0
2800 www.atlasobscura.com/articles/written-in-the-s... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-05-11 2892.0
2801 www.atlasobscura.com/articles/wwii-to-syria-ho... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-12 5503.0
2802 www.atlasobscura.com/articles/xylothek 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-22 8876.0
2803 www.atlasobscura.com/articles/yarn-stores-cand... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-30 11984.0
2804 www.atlasobscura.com/articles/you-can-now-take... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-15 3507.0
2805 www.atlasobscura.com/articles/you-still-have-t... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-14 13.0
2806 www.atlasobscura.com/articles/your-new-favorit... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-13 2256.0
2807 www.atlasobscura.com/articles/your-ticket-to-t... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-13 690.0
2808 www.atlasobscura.com/articles/youre-not-a-true... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-18 823.0
2809 www.atlasobscura.com/articles/youve-visited-10... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-16 1660.0
2810 www.atlasobscura.com/articles/zeroes-after-zer... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-01-07 11902.0
2811 www.atlasobscura.com/articles/zombie-mines-hau... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-16 1676.0
2812 www.atlasobscura.com/articles/zzyzx-california... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-24 43015.0

2813 rows × 986 columns


In [27]:
article_set['upper_quartile'] = [1 if x > 10000 else 0 for x in article_set.pageviews]

In [28]:
article_set.pageviews.plot(kind='hist', bins=100,title='Page View Distribution, All Content')


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1060f1510>

In [29]:
article_set['published'] = pd.to_datetime(article_set['published'])

In [30]:
article_set


Out[30]:
100-wonders 19th-century 2016-election 30-rock 31-days-of-halloween abandoned abandoned-amusement-parks abandoned-brooklyn abandoned-cemetaries abandoned-hospitals ... wunderkammer wwi wwii yehlui-geological-park yeti zombies zoos published pageviews upper_quartile
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-01 651.0 0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-09 3505.0 0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-12 840.0 0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-30 4037.0 0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-01-07 1620.0 0
www.atlasobscura.com/articles/100-wonders-an-island-you-dont-want-to-visit 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-20 4049.0 0
www.atlasobscura.com/articles/100-wonders-battleship-island 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-17 2727.0 0
www.atlasobscura.com/articles/100-wonders-blood-falls 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-15 1290.0 0
www.atlasobscura.com/articles/100-wonders-clown-motel 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-21 1450.0 0
www.atlasobscura.com/articles/100-wonders-desertron 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-03 2635.0 0
www.atlasobscura.com/articles/100-wonders-devils-kettle 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-19 2886.0 0
www.atlasobscura.com/articles/100-wonders-edisons-last-breath 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-12 1600.0 0
www.atlasobscura.com/articles/100-wonders-its-taco-time 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-28 792.0 0
www.atlasobscura.com/articles/100-wonders-last-tree-of-tenere 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-10 3267.0 0
www.atlasobscura.com/articles/100-wonders-model-behavior 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-30 6290.0 0
www.atlasobscura.com/articles/100-wonders-necropants 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-15 8366.0 0
www.atlasobscura.com/articles/100-wonders-new-york-s-triangle-of-shame 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-18 10149.0 1
www.atlasobscura.com/articles/100-wonders-the-arrow-stork 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-23 1625.0 0
www.atlasobscura.com/articles/100-wonders-the-atomic-clock 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-02-04 998.0 0
www.atlasobscura.com/articles/100-wonders-the-blue-lagoon-of-buxton 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-17 3102.0 0
www.atlasobscura.com/articles/100-wonders-the-bone-church 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-26 2396.0 0
www.atlasobscura.com/articles/100-wonders-the-cave-of-crystals 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-24 2490.0 0
www.atlasobscura.com/articles/100-wonders-the-classiest-saint-relic-in-europe 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-25 1294.0 0
www.atlasobscura.com/articles/100-wonders-the-color-of-control 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-09 2509.0 0
www.atlasobscura.com/articles/100-wonders-the-dyatlov-incident 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-08 4845.0 0
www.atlasobscura.com/articles/100-wonders-the-everlasting-lightning-storm 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-22 2366.0 0
www.atlasobscura.com/articles/100-wonders-the-gates-of-hell 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-10 5815.0 0
www.atlasobscura.com/articles/100-wonders-the-glowing-ocean 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-14 1654.0 0
www.atlasobscura.com/articles/100-wonders-the-great-boston-molasses-flood 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-18 3708.0 0
www.atlasobscura.com/articles/100-wonders-the-great-green-wall-of-africa 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-27 1841.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
www.atlasobscura.com/articles/williamsburg-savings-bank-restoration 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-02-12 1101.0 0
www.atlasobscura.com/articles/winters-effigies-the-deviant-history-of-the-snowman 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-12-17 16565.0 1
www.atlasobscura.com/articles/wishing-trees 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-07-30 5333.0 0
www.atlasobscura.com/articles/without-people-project 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-02-17 3408.0 0
www.atlasobscura.com/articles/wolhusen-mortuary-chapel-where-real-skulls-join-a-dance-of-death 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-01-19 2469.0 0
www.atlasobscura.com/articles/wonderland-lost-the-abandoned-beijing-amusement-park-is-razed 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-05-16 7380.0 0
www.atlasobscura.com/articles/wonders-of-polar-architecture 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-01-28 1883.0 0
www.atlasobscura.com/articles/woody-guthries-wardy-forty 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-04-03 3541.0 0
www.atlasobscura.com/articles/woolly-mammoth-on-display-in-japan 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-07-16 1740.0 0
www.atlasobscura.com/articles/working-at-a-cookie-factory-ruined-cookies-for-me-forever 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-29 6031.0 0
www.atlasobscura.com/articles/world-record-filibuster-ends-after-192-hours-of-orwell-and-internet-comments 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-02 3258.0 0
www.atlasobscura.com/articles/world-s-largest-manta-ray-trafficker-bust-in-indonesia 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-12-04 93.0 0
www.atlasobscura.com/articles/world-s-oldest-botanical-gardens 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-10-15 4814.0 0
www.atlasobscura.com/articles/world-wingsuit-league-china-grand-prix-2015 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-14 1135.0 0
www.atlasobscura.com/articles/worlds-fair-relics-paris 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-01-07 2233.0 0
www.atlasobscura.com/articles/worldwide-scotch-shortage-compounds-existing-bourbon-scarcity 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-06 23143.0 1
www.atlasobscura.com/articles/wrapping-armchairs-in-wire-and-other-childhood-attempts-to-travel-in-time 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-02-03 1409.0 0
www.atlasobscura.com/articles/written-in-the-skin-3-places-to-find-books-bound-in-skin 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-05-11 2892.0 0
www.atlasobscura.com/articles/wwii-to-syria-how-seed-vaults-weather-wars 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-10-12 5503.0 0
www.atlasobscura.com/articles/xylothek 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-22 8876.0 0
www.atlasobscura.com/articles/yarn-stores-candy-shops-funeral-homes-and-more-of-the-uncategorizable-punny-businesses 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-30 11984.0 1
www.atlasobscura.com/articles/you-can-now-take-your-pot-to-the-skies-in-oregon 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-15 3507.0 0
www.atlasobscura.com/articles/you-still-have-time-to-apply-to-be-a-fulltime-ninja-in-japan 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2016-03-14 13.0 0
www.atlasobscura.com/articles/your-new-favorite-honey-is-made-out-of-bug-poop-and-bee-vomit 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-13 2256.0 0
www.atlasobscura.com/articles/your-ticket-to-the-1893-columbian-exposition 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-07-13 690.0 0
www.atlasobscura.com/articles/youre-not-a-true-australian-until-youve-been-divebombed-by-a-magpie 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-18 823.0 0
www.atlasobscura.com/articles/youve-visited-100-countries-join-the-club 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-11-16 1660.0 0
www.atlasobscura.com/articles/zeroes-after-zeroes-the-worlds-highest-currencies 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-01-07 11902.0 1
www.atlasobscura.com/articles/zombie-mines-haunt-the-landscape 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-16 1676.0 0
www.atlasobscura.com/articles/zzyzx-california-or-the-biggest-health-spa-scam-in-american-history 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-09-24 43015.0 1

2813 rows × 986 columns


In [31]:
article_set['year'] = pd.DatetimeIndex(article_set['published']).year

In [32]:
ax = article_set.boxplot(column='pageviews',by='year',figsize=(6,6),showfliers=False)
ax.set(title='PV distribution by year',ylabel='pageviews')


Out[32]:
[<matplotlib.text.Text at 0x1181edb50>, <matplotlib.text.Text at 0x1181f9150>]

Articles published more recently have, on average, received much more traffic than older articles (this reflects growth and heavier distribution of the newer content). The drop in the mean as we move into 2016 is an artifact of the article's lifecycle not being complete.

Article lifecycle will be explored below.


In [33]:
yearly = article_set.set_index('published').resample('M').mean().plot(y='pageviews')
yearly.set(title='Total Pageviews By Month of Article Publication')


Out[33]:
[<matplotlib.text.Text at 0x118f3e690>]

Let's import the time-series I created with a python script that asks GA for the daily time-series of Pageviewsof each article from publication date forward two years.


In [35]:
time_series = pd.read_csv('time-series.csv')

In [36]:
type(time_series)


Out[36]:
pandas.core.frame.DataFrame

In [37]:
time_series = time_series.drop('Unnamed: 0',axis=1)

It was easier to collect the data from GA by looping over the columns in my original dataframe, but having each row be an article record is easier to work with now, so we transpose.


In [38]:
time_series = time_series.T

In [39]:
time_series.columns


Out[39]:
RangeIndex(start=0, stop=731, step=1)

In [40]:
time_series['total'] = time_series.sum(axis=1)

In [41]:
time_series.head()


Out[41]:
0 1 2 3 4 5 6 7 8 9 ... 722 723 724 725 726 727 728 729 730 total
10-little-known-beaches-to-explore-in-the-last-days-of-summer 2.0 419.0 203.0 19.0 4.0 6.0 2.0 7.0 4.0 35.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 940.0
10-of-the-greatest-overland-migrations-photos 468.0 368.0 658.0 325.0 138.0 40.0 33.0 77.0 63.0 21.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 4826.0
10-places-12-year-old-me-would-love-to-live 106.0 762.0 271.0 132.0 209.0 96.0 41.0 15.0 9.0 9.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 4621.0
10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 2186.0 538.0 209.0 377.0 92.0 134.0 80.0 34.0 18.0 75.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 4482.0
100-wonders-a-visit-with-a-frozen-dead-guy 928.0 272.0 231.0 87.0 96.0 40.0 16.0 11.0 7.0 7.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2032.0

5 rows × 732 columns

Let's determine how many days post-publication it takes for an article to collect 90% of total pageviews.


In [42]:
time_series['days_to_90p']= [(time_series.iloc[x].expanding().sum() > time_series.iloc[x].total*.90).argmax() \
                                 for x in range(len(time_series))]

In [43]:
time_series.reset_index(inplace=True)

In [44]:
time_series.head(1)


Out[44]:
index 0 1 2 3 4 5 6 7 8 ... 723 724 725 726 727 728 729 730 total days_to_90p
0 10-little-known-beaches-to-explore-in-the-last... 2.0 419.0 203.0 19.0 4.0 6.0 2.0 7.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 940.0 189

1 rows × 734 columns


In [45]:
time_series['index'] = ['www.atlasobscura.com/articles/' + x for x in time_series['index']]
time_series.set_index('index',inplace=True)
time_series = time_series.join(pageviews.published)
time_series.head(5)


Out[45]:
0 1 2 3 4 5 6 7 8 9 ... 724 725 726 727 728 729 730 total days_to_90p published
index
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 2.0 419.0 203.0 19.0 4.0 6.0 2.0 7.0 4.0 35.0 ... NaN NaN NaN NaN NaN NaN NaN 940.0 189 2015-08-01
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 468.0 368.0 658.0 325.0 138.0 40.0 33.0 77.0 63.0 21.0 ... NaN NaN NaN NaN NaN NaN NaN 4826.0 230 2015-06-09
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 106.0 762.0 271.0 132.0 209.0 96.0 41.0 15.0 9.0 9.0 ... NaN NaN NaN NaN NaN NaN NaN 4621.0 634 2014-05-12
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 2186.0 538.0 209.0 377.0 92.0 134.0 80.0 34.0 18.0 75.0 ... NaN NaN NaN NaN NaN NaN NaN 4482.0 19 2015-12-30
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 928.0 272.0 231.0 87.0 96.0 40.0 16.0 11.0 7.0 7.0 ... NaN NaN NaN NaN NaN NaN NaN 2032.0 45 2016-01-07

5 rows × 734 columns


In [46]:
time_series['published'] = pd.to_datetime(time_series.published)

In [47]:
time_series['year_pub'] = pd.DatetimeIndex(time_series['published']).year

In [48]:
time_series.boxplot(column='days_to_90p',by='year_pub')


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d1c8250>

In [49]:
time_series.year_pub.value_counts(dropna=False)


Out[49]:
 2015.0    1346
 2016.0     775
 2014.0     476
 2013.0     447
 2012.0      30
NaN          11
 2010.0       3
 2011.0       2
Name: year_pub, dtype: int64

In [50]:
time_series[['days_to_90p','total','year_pub']].corr()


Out[50]:
days_to_90p total year_pub
days_to_90p 1.000000 -0.058821 -0.742601
total -0.058821 1.000000 0.092965
year_pub -0.742601 0.092965 1.000000

In [403]:
#I DON'T KNOW WHY THIS WON'T WORK
time_series['30-day-PVs'] = [time_series.fillna(value=0).iloc[x,0:31].sum() for x in range(len(time_series))]

In [417]:
time_series['7-day-PVs'] = [time_series.fillna(value=0).iloc[x,0:8].sum() for x in range(len(time_series))]

Now let's look at the number of articles per tag (we will later join the two DataFrames above into one)


In [92]:
total_tagged= pd.DataFrame(data=article_set.sum(),columns = ['num_tagged'])

In [93]:
total_tagged.sort_values('num_tagged',ascending=False,inplace=True)

In [94]:
total_tagged.drop('pageviews',axis=0,inplace=True)

In [95]:
total_tagged[total_tagged.num_tagged >= 10].count()


Out[95]:
num_tagged    199
dtype: int64

In [96]:
total_tagged[total_tagged.num_tagged <=5].index


Out[96]:
Index([u'india', u'funeral-art', u'banks', u'bioluminescence', u'bars',
       u'assassination', u'utopias', u'flora', u'turkey', u'bicycles',
       ...
       u'earthquakes', u'pink', u'pigs', u'physics', u'edmund-hillary',
       u'education', u'philip-k-dick', u'pharmacy-museums', u'egypt',
       u'cybersecurity'],
      dtype='object', length=679)

In [124]:
#tag_analysis = article_set.drop(total_tagged[total_tagged.num_tagged < 5].index,axis=1)
#I'm resetting tag_analysis to contain all tags so I can manipulate later whenever I want. It makes it more clear.
tag_analysis = article_set

In [98]:
print tag_analysis.shape
tag_analysis.head()


(2813, 354)
Out[98]:
100-wonders 31-days-of-halloween abandoned abandoned-amusement-parks abandoned-hospitals abandoned-insane-asylums abe-day aircraft airplanes airports ... world-s-fair world-s-oldest wunderkammer wwi wwii zombies published pageviews upper_quartile year
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2015-08-01 651.0 0 2015.0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-09 3505.0 0 2015.0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-12 840.0 0 2014.0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-30 4037.0 0 2015.0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2016-01-07 1620.0 0 2016.0

5 rows × 354 columns


In [60]:
tag_analysis.tail()
tag_analysis.to_csv('tag_analysis_ready.csv')

In [99]:
total_tagged.head(30)
print total_tagged.shape


(985, 1)

In [100]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True)

In [101]:
poly_df = pd.DataFrame(poly.fit_transform(tag_analysis.fillna(0).drop(['published','pageviews','upper_quartile','year'],axis=1)))

In [102]:
poly.n_output_features_


Out[102]:
61426

In [103]:
total_tagged.ix['extra-mile']


Out[103]:
num_tagged    16.0
Name: extra-mile, dtype: float64

In [104]:
regular_features = ['places-you-can-no-longer-go','100-wonders','extra-mile','video-wonders','news','features','columns',
                    'found','animals','fleeting-wonders','visual','other-capitals-of-the-world','video','art','list','objects-of-intrigue',
                    'maps','morbid-monday','female-explorers','naturecultures']

In [125]:
total_tagged[total_tagged.num_tagged >10].shape


Out[125]:
(185, 1)

In [304]:
interactions = pd.DataFrame()

In [305]:
for item in regular_features:
    for column in tag_analysis.drop(['published','pageviews','upper_quartile','year'],axis=1).drop(
         total_tagged[total_tagged.num_tagged < 10].index,axis=1).columns:
        interactions[(item + '_' + column)] = tag_analysis[item] + tag_analysis[column]
#Just sum the row and column and then turn any 2s into 1s and 1s into zeros.

In [306]:
def correct_values(x):
    if x == 2.0:
        return 1
    elif x == 1.0:
        return 0
    else:
        return 0
for item in interactions.columns:
    interactions[item] = interactions[item].apply(correct_values)

In [307]:
interactions.head(2)


Out[307]:
places-you-can-no-longer-go_100-wonders places-you-can-no-longer-go_31-days-of-halloween places-you-can-no-longer-go_abandoned places-you-can-no-longer-go_abandoned-insane-asylums places-you-can-no-longer-go_aircraft places-you-can-no-longer-go_airplanes places-you-can-no-longer-go_amusement-parks places-you-can-no-longer-go_ancient places-you-can-no-longer-go_animal-week places-you-can-no-longer-go_animals ... naturecultures_volcanoes naturecultures_war naturecultures_water naturecultures_watery-wonders naturecultures_weird-weather-phenomena naturecultures_whales naturecultures_witchcraft naturecultures_women naturecultures_world-s-fair naturecultures_wwii
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 3940 columns


In [308]:
tagged_total = pd.DataFrame(data =interactions.sum(),columns=['num_tagged'])
tagged_total = tagged_total.sort_values('num_tagged',ascending=False)

In [309]:
identity_tags = tagged_total[0:26].index

In [310]:
interactions = interactions.drop(identity_tags,axis=1)

In [311]:
tagged_total = pd.DataFrame(data =interactions.sum(),columns=['num_tagged'])
tagged_total = tagged_total.sort_values('num_tagged',ascending=False)
tagged_total.head(10)


Out[311]:
num_tagged
news_space 43
animals_features 38
news_animals 38
animals_news 38
features_animals 38
100-wonders_video 37
video_100-wonders 37
columns_features 35
features_columns 35
columns_map-monday 33

In [312]:
#DO I WANT TO DROP THE EMPTY COLUMNS?
#for item in interactions.columns:
 #   if interactions[item].sum == 0:
#      interactions = interactions.drop(item,axis=1)

In [313]:
interactions.head(10)


Out[313]:
places-you-can-no-longer-go_100-wonders places-you-can-no-longer-go_31-days-of-halloween places-you-can-no-longer-go_abandoned places-you-can-no-longer-go_abandoned-insane-asylums places-you-can-no-longer-go_aircraft places-you-can-no-longer-go_airplanes places-you-can-no-longer-go_amusement-parks places-you-can-no-longer-go_ancient places-you-can-no-longer-go_animal-week places-you-can-no-longer-go_animals ... naturecultures_volcanoes naturecultures_war naturecultures_water naturecultures_watery-wonders naturecultures_weird-weather-phenomena naturecultures_whales naturecultures_witchcraft naturecultures_women naturecultures_world-s-fair naturecultures_wwii
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/100-wonders-an-island-you-dont-want-to-visit 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/100-wonders-battleship-island 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/100-wonders-blood-falls 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/100-wonders-clown-motel 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
www.atlasobscura.com/articles/100-wonders-desertron 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 3914 columns


In [314]:
interactions = interactions.join(pageviews)

In [315]:
#drop empty cols
def drop_zero_cols(df):
    for item in df.columns:
        if df[item].sum() == 0:
            df = df.drop(item,axis=1)
        else:
            continue
    return df

In [316]:
interactions = drop_zero_cols(interactions.fillna(0).drop(['published','pageviews'],axis=1))
interactions = interactions.join(pageviews)

In [317]:
interactions.head(1)


Out[317]:
places-you-can-no-longer-go_castles places-you-can-no-longer-go_cemeteries places-you-can-no-longer-go_cheat-week places-you-can-no-longer-go_escape-week places-you-can-no-longer-go_film places-you-can-no-longer-go_garbage places-you-can-no-longer-go_garbage-week places-you-can-no-longer-go_islands places-you-can-no-longer-go_japan places-you-can-no-longer-go_nazis ... naturecultures_science naturecultures_sounds naturecultures_space naturecultures_time-week naturecultures_transportation naturecultures_trees naturecultures_underground-week naturecultures_wwii published pageviews
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2015-08-01 651.0

1 rows × 1236 columns


In [318]:
interaction_totals = pd.DataFrame(interactions.sum().sort_values(ascending=False),columns=['num_tagged'])

In [345]:
interaction_totals[interaction_totals.num_tagged < 4].shape


Out[345]:
(1008, 1)

In [346]:
interactions_analysis = interactions.drop(interaction_totals[interaction_totals.num_tagged < 4].index,axis=1)

In [347]:
interactions_analysis.head()


Out[347]:
100-wonders_disaster-areas 100-wonders_disasters 100-wonders_science 100-wonders_video extra-mile_columns extra-mile_extra-mile video-wonders_animals video-wonders_australia video-wonders_sports news_airplanes ... morbid-monday_relics female-explorers_columns female-explorers_female-explorers female-explorers_kickass-women naturecultures_animals naturecultures_columns naturecultures_features naturecultures_naturecultures published pageviews
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2015-08-01 651.0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2015-06-09 3505.0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2014-05-12 840.0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2015-12-30 4037.0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2016-01-07 1620.0

5 rows × 228 columns


In [348]:
#Check whether number of Aggregated stories published per day has an impact on average/total Day 0 - 1 traffic.

In [349]:
from sklearn import linear_model
from sklearn import metrics
from sklearn import cross_validation

In [350]:
interactions_analysis['upper_quartile'] = [1 if x > 10000 else 0 for x in interactions.pageviews]

In [351]:
interactions_analysis['twenty_thousand'] = [1 if x > 20000 else 0 for x in interactions.pageviews]

In [352]:
y = interactions_analysis.upper_quartile
X = interactions_analysis.drop(['pageviews','published','upper_quartile','twenty_thousand'],axis=1)

In [353]:
kf = cross_validation.KFold(len(interactions_analysis),n_folds=5)
scores = []
for train_index, test_index in kf:
    lr = linear_model.LogisticRegression().fit(X.iloc[train_index],y.iloc[train_index])
    scores.append(lr.score(X.iloc[test_index],y.iloc[test_index]))
print "average accuracy for LogisticRegression is", np.mean(scores)
print "average of the set is: ", np.mean(y)


average accuracy for LogisticRegression is 0.846046535148
average of the set is:  0.151795236402

In [354]:
interactions_lr_scores = lr.predict_proba(X)[:,1]

In [355]:
print metrics.roc_auc_score(y,interactions_lr_scores)


0.632145261881

In [356]:
interactions_probabilities = pd.DataFrame(zip(X.columns,interactions_lr_scores),columns=['tags','probabilities'])

In [357]:
interactions_probabilities.sort_values('probabilities',ascending=False)


Out[357]:
tags probabilities
109 features_time-week 0.436462
53 features_animals 0.400695
57 features_birds 0.368227
149 animals_animal-week 0.296609
111 features_underground-week 0.283282
116 features_watery-wonders 0.283282
150 animals_birds 0.270520
209 maps_exploration 0.261088
110 features_tunnels 0.261088
24 news_features 0.183567
223 naturecultures_columns 0.178019
190 art_fleeting-wonders 0.175223
42 news_sculptures 0.169818
48 news_trees 0.169818
26 news_fossils 0.169818
158 animals_found 0.163157
163 animals_video 0.155569
138 found_archaeology 0.155189
118 features_women 0.154375
168 fleeting-wonders_food 0.151925
101 features_sculptures 0.148187
69 features_fashion 0.147209
171 fleeting-wonders_sports 0.141600
145 found_science 0.129710
146 found_shipwrecks 0.129710
147 found_space 0.129710
148 found_war 0.129710
154 animals_features 0.129710
151 animals_cats 0.129710
153 animals_dogs 0.129710
... ... ...
50 news_volcanoes 0.058617
49 news_underwater 0.058617
13 news_architecture 0.058617
14 news_art 0.058617
15 news_australia 0.058617
16 news_birds 0.058617
17 news_books 0.058617
37 news_oceans 0.058617
39 news_politics 0.058617
20 news_crime-and-punishment 0.058617
29 news_insects 0.058617
36 news_nasa 0.058617
40 news_religion 0.058617
34 news_music 0.058617
41 news_science 0.058617
32 news_literature 0.058617
31 news_japan 0.058617
30 news_islands 0.058617
43 news_shipwrecks 0.058617
25 news_food 0.058617
44 news_snakes 0.058617
45 news_space 0.058617
66 features_crime-and-punishment 0.053248
105 features_sports 0.040151
9 news_airplanes 0.040049
46 news_sports 0.040049
27 news_garbage-week 0.040049
23 news_dogs 0.040049
18 news_churches 0.040049
143 found_maps 0.036773

226 rows × 2 columns


In [475]:
interaction_totals.head(2)


Out[475]:
num_tagged
pageviews 21941190.0
news_space 43.0

In [469]:
def split_tag(x):
    return x.split('_')[1]
interactions_probabilities = interactions_probabilities.reset_index()
interactions_probabilities['subtag'] = interactions_probabilities.tags.apply(split_tag)

In [477]:
interactions_probabilities = interactions_probabilities.sort_values(['tags','probabilities'],ascending=[1, 0])

In [471]:
interactions_probabilities = interactions_probabilities.set_index('tags').join(interaction_totals)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-471-09b7b5167df0> in <module>()
----> 1 interactions_probabilities = interactions_probabilities.set_index('tags').join(interaction_totals)

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4367         # For SparseDataFrame's benefit
   4368         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4369                                  rsuffix=rsuffix, sort=sort)
   4370 
   4371     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4381             return merge(self, other, left_on=on, how=how,
   4382                          left_index=on is None, right_index=True,
-> 4383                          suffixes=(lsuffix, rsuffix), sort=sort)
   4384         else:
   4385             if on is not None:

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     33                          right_index=right_index, sort=sort, suffixes=suffixes,
     34                          copy=copy, indicator=indicator)
---> 35     return op.get_result()
     36 if __debug__:
     37     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in get_result(self)
    210 
    211         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 212                                                      rdata.items, rsuf)
    213 
    214         lindexers = {1: left_indexer} if left_indexer is not None else {}

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   4372         if not lsuffix and not rsuffix:
   4373             raise ValueError('columns overlap but no suffix specified: %s' %
-> 4374                              to_rename)
   4375 
   4376         def lrenamer(x):

ValueError: columns overlap but no suffix specified: Index([u'num_tagged'], dtype='object')

In [478]:
interactions_probabilities


Out[478]:
tags probabilities subtag num_tagged
184 100-wonders_disaster-areas 0.099160 disaster-areas 6.0
23 100-wonders_disasters 0.129710 disasters 6.0
24 100-wonders_science 0.129710 science 5.0
25 100-wonders_video 0.129710 video 37.0
3 animals_animal-week 0.296609 animal-week 10.0
6 animals_birds 0.270520 birds 8.0
26 animals_cats 0.129710 cats 8.0
176 animals_columns 0.121526 columns 8.0
27 animals_dogs 0.129710 dogs 7.0
28 animals_features 0.129710 features 38.0
29 animals_fleeting-wonders 0.129710 fleeting-wonders 12.0
30 animals_food 0.129710 food 4.0
31 animals_fossils 0.129710 fossils 4.0
15 animals_found 0.163157 found 21.0
32 animals_list 0.129710 list 8.0
187 animals_naturecultures 0.085154 naturecultures 6.0
188 animals_news 0.082413 news 38.0
33 animals_oceans 0.129710 oceans 7.0
16 animals_video 0.155569 video 8.0
178 animals_video-wonders 0.116927 video-wonders 6.0
34 art_columns 0.129710 columns 6.0
191 art_features 0.077260 features 12.0
11 art_fleeting-wonders 0.175223 fleeting-wonders 4.0
35 art_libraries 0.129710 libraries 5.0
36 art_museums 0.129710 museums 4.0
37 art_museums-and-collections 0.129710 museums-and-collections 4.0
38 art_news 0.129710 news 11.0
39 art_sculptures 0.129710 sculptures 6.0
40 art_visual 0.129710 visual 16.0
41 columns_animals 0.129710 animals 8.0
... ... ... ... ...
212 news_snakes 0.058617 snakes 5.0
213 news_space 0.058617 space 43.0
224 news_sports 0.040049 sports 7.0
151 news_statues 0.129710 statues 5.0
14 news_trees 0.169818 trees 5.0
214 news_underwater 0.058617 underwater 7.0
215 news_volcanoes 0.058617 volcanoes 5.0
152 news_war 0.129710 war 4.0
153 news_water 0.129710 water 4.0
154 objects-of-intrigue_features 0.129710 features 7.0
155 objects-of-intrigue_space 0.129710 space 5.0
156 other-capitals-of-the-world_other-capitals-of-... 0.129710 other-capitals-of-the-world 12.0
216 video-wonders_animals 0.058617 animals 6.0
217 video-wonders_australia 0.058617 australia 4.0
163 video-wonders_sports 0.129710 sports 5.0
157 video_100-wonders 0.129710 100-wonders 37.0
158 video_animals 0.129710 animals 8.0
159 video_disaster-areas 0.129710 disaster-areas 5.0
160 video_disasters 0.129710 disasters 5.0
161 video_science 0.129710 science 5.0
162 video_sports 0.129710 sports 5.0
164 visual_abandoned 0.129710 abandoned 6.0
165 visual_architecture 0.129710 architecture 9.0
166 visual_art 0.129710 art 16.0
179 visual_features 0.110809 features 12.0
190 visual_list 0.077772 list 22.0
167 visual_photo-of-the-week 0.129710 photo-of-the-week 10.0
168 visual_photography 0.129710 photography 20.0
169 visual_soviet 0.129710 soviet 6.0
170 visual_space 0.129710 space 9.0

226 rows × 4 columns


In [ ]:


In [ ]:


In [567]:
interactions_probabilities['pageviews'] = [sum(interactions['pageviews'][interactions[item]==1]) for item in interactions_probabilities.tags]

In [570]:
interactions_probabilities['mean-PVs'] = interactions_probabilities['pageviews'] // interactions_probabilities['num_tagged']

In [579]:
regular_features


Out[579]:
['places you can no longer go',
 '100 wonders',
 'extra mile',
 'video wonders',
 'news',
 'features',
 'columns',
 'found',
 'animals',
 'fleeting wonders',
 'visual',
 'other capitals of the world',
 'video',
 'art',
 'list',
 'objects of intrigue',
 'maps',
 'morbid monday',
 'female explorers',
 'naturecultures']

In [623]:
interactions_probabilities[interactions_probabilities.tags.str.contains('features')==True].sort_values('mean-PVs',
                                                                                                   ascending = False)


Out[623]:
tags probabilities subtag num_tagged pageviews mean-PVs
75 features_linguistics 0.129710 linguistics 4.0 207623.0 51905.0
82 features_miracles-week 0.129710 miracles-week 8.0 390486.0 48810.0
63 features_computers 0.129710 computers 7.0 263227.0 37603.0
91 features_plants 0.129710 plants 5.0 145866.0 29173.0
66 features_film 0.129710 film 14.0 384072.0 27433.0
100 features_television 0.129710 television 8.0 200192.0 25024.0
191 art_features 0.077260 features 12.0 284906.0 23742.0
192 features_art 0.061562 art 12.0 284906.0 23742.0
74 features_language 0.129710 language 4.0 87118.0 21779.0
0 features_time-week 0.436462 time-week 10.0 171105.0 17110.0
101 features_video-games 0.129710 video-games 9.0 151971.0 16885.0
95 features_science-fiction 0.129710 science-fiction 4.0 66197.0 16549.0
106 features_wwii 0.129710 wwii 6.0 84399.0 14066.0
58 features_books 0.129710 books 8.0 107780.0 13472.0
56 features_architecture 0.129710 architecture 6.0 69250.0 11541.0
5 features_watery-wonders 0.283282 watery-wonders 4.0 43674.0 10918.0
87 features_naturecultures 0.129710 naturecultures 26.0 279450.0 10748.0
173 naturecultures_features 0.128812 features 26.0 279450.0 10748.0
85 features_murder 0.129710 murder 4.0 42335.0 10583.0
68 features_games 0.129710 games 8.0 84261.0 10532.0
59 features_cats 0.129710 cats 6.0 59547.0 9924.0
62 features_columns 0.129710 columns 35.0 339985.0 9713.0
46 columns_features 0.129710 features 35.0 339985.0 9713.0
81 features_military 0.129710 military 6.0 55619.0 9269.0
185 features_religion 0.098450 religion 5.0 44193.0 8838.0
177 features_space 0.119270 space 8.0 67853.0 8481.0
180 features_crime 0.107340 crime 6.0 49116.0 8186.0
4 features_underground-week 0.283282 underground-week 8.0 65102.0 8137.0
77 features_literature 0.129710 literature 9.0 68614.0 7623.0
102 features_visual 0.129710 visual 12.0 89421.0 7451.0
... ... ... ... ... ... ...
84 features_monsters 0.129710 monsters 6.0 35139.0 5856.0
105 features_witchcraft 0.129710 witchcraft 4.0 23395.0 5848.0
73 features_kickass-women 0.129710 kickass-women 5.0 29003.0 5800.0
21 features_fashion 0.147209 fashion 5.0 28595.0 5719.0
94 features_science 0.129710 science 15.0 82426.0 5495.0
64 features_dinosaurs 0.129710 dinosaurs 4.0 21218.0 5304.0
2 features_birds 0.368227 birds 9.0 47373.0 5263.0
88 features_new-york-city 0.129710 new-york-city 6.0 31498.0 5249.0
86 features_music 0.129710 music 12.0 61632.0 5136.0
80 features_medicine 0.129710 medicine 5.0 21993.0 4398.0
104 features_water 0.129710 water 7.0 30590.0 4370.0
69 features_garbage 0.129710 garbage 4.0 17274.0 4318.0
9 news_features 0.183567 features 6.0 25507.0 4251.0
89 features_news 0.129710 news 6.0 25507.0 4251.0
20 features_sculptures 0.148187 sculptures 5.0 20346.0 4069.0
99 features_technology 0.129710 technology 4.0 15892.0 3973.0
57 features_birdweek 0.129710 birdweek 9.0 35752.0 3972.0
98 features_statues 0.129710 statues 4.0 15627.0 3906.0
60 features_cheat-week 0.129710 cheat-week 8.0 31149.0 3893.0
103 features_war 0.129710 war 5.0 18591.0 3718.0
92 features_politics 0.129710 politics 14.0 50899.0 3635.0
219 features_sports 0.040151 sports 13.0 40004.0 3077.0
93 features_presidents 0.129710 presidents 4.0 12032.0 3008.0
7 features_tunnels 0.261088 tunnels 4.0 11691.0 2922.0
61 features_china 0.129710 china 9.0 17069.0 1896.0
71 features_halloween 0.129710 halloween 4.0 6578.0 1644.0
97 features_sounds 0.129710 sounds 4.0 5630.0 1407.0
96 features_snow 0.129710 snow 4.0 4647.0 1161.0
90 features_objects-of-intrigue 0.129710 objects-of-intrigue 7.0 NaN NaN
154 objects-of-intrigue_features 0.129710 features 7.0 NaN NaN

76 rows × 6 columns


In [ ]:
interactions_probabilities.sort_values('probabilities',ascending = False)

In [625]:
np.mean(interactions.pageviews)


Out[625]:
7938.2018813314035

In [620]:
#I took the dashes out. Have to add back for this
fix_regular_features = [x.replace(' ','-') for x in regular_features]
fig,axes=plt.subplots(figsize=(10,10))
for item, name in enumerate(fix_regular_features):
    interactions.plot(x=interactions['pageviews'][interactions.columns.str.contains(name)==True],kind='box',ax=item)
plt.show()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-620-866e8988a9c8> in <module>()
      3 fig,axes=plt.subplots(figsize=(10,10))
      4 for item, name in enumerate(fix_regular_features):
----> 5     interactions.plot(x=interactions['pageviews'][interactions.columns.str.contains(name)==True],kind='boxplot',ax=item)
      6 plt.show()

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in __call__(self, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   3735                           fontsize=fontsize, colormap=colormap, table=table,
   3736                           yerr=yerr, xerr=xerr, secondary_y=secondary_y,
-> 3737                           sort_columns=sort_columns, **kwds)
   3738     __call__.__doc__ = plot_frame.__doc__
   3739 

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in plot_frame(data, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   2609                  yerr=yerr, xerr=xerr,
   2610                  secondary_y=secondary_y, sort_columns=sort_columns,
-> 2611                  **kwds)
   2612 
   2613 

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(data, x, y, subplots, ax, kind, **kwds)
   2388         klass = _plot_klass[kind]
   2389     else:
-> 2390         raise ValueError("%r is not a valid plot kind" % kind)
   2391 
   2392     from pandas import DataFrame

ValueError: 'boxplot' is not a valid plot kind

In [453]:
#doublecheck my work on pageviews vs num-published
pub_volume = tag_analysis[['published','pageviews']]
pub_volume['num_pubbed'] = 1
pub_volume['published'] = pd.to_datetime(pub_volume.published)
pub_volume = pub_volume.set_index('published')


/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [454]:
pub_volume.head(10)


Out[454]:
pageviews num_pubbed
published
2015-08-01 651.0 1
2015-06-09 3505.0 1
2014-05-12 840.0 1
2015-12-30 4037.0 1
2016-01-07 1620.0 1
2015-08-20 4049.0 1
2015-12-17 2727.0 1
2015-09-15 1290.0 1
2015-10-21 1450.0 1
2015-12-03 2635.0 1

In [455]:
pub_volume = pub_volume.resample('M').sum().dropna()

In [456]:
pub_volume['year'] = pub_volume.index.year

In [457]:
pub_volume[pub_volume.index.year >=2015].corr()


Out[457]:
pageviews num_pubbed year
pageviews 1.000000 0.926886 0.506975
num_pubbed 0.926886 1.000000 0.650054
year 0.506975 0.650054 1.000000

In [458]:
pub_volume[pub_volume.index.year >=2015].plot(kind='scatter',x='num_pubbed',y='pageviews')


Out[458]:
<matplotlib.axes._subplots.AxesSubplot at 0x138e54750>

In [459]:
import seaborn as sns
ax = sns.regplot(x='num_pubbed',y='pageviews',data=pub_volume)


Now I'm going to try this with the time series 30days pvs


In [446]:
#doublecheck my work on pageviews vs num-published
pub_volume = time_series[['published','7-day-PVs']]
pub_volume['num_pubbed'] = 1
pub_volume['published'] = pd.to_datetime(pub_volume.published)
pub_volume = pub_volume.set_index('published')


/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [447]:
pub_volume.head(10)


Out[447]:
7-day-PVs num_pubbed
published
2015-08-01 662.0 1
2015-06-09 2107.0 1
2014-05-12 1632.0 1
2015-12-30 3650.0 1
2016-01-07 1681.0 1
2015-08-20 624.0 1
2015-12-17 2442.0 1
2015-09-15 749.0 1
2015-10-21 1590.0 1
2015-12-03 1483.0 1

In [448]:
num_holder = pub_volume.resample('D').sum().dropna().drop('7-day-PVs',axis=1)
pub_volume = pub_volume.resample('D').sum().dropna().drop('num_pubbed',axis=1)
pub_volume = pub_volume.join(num_holder)
pub_volume['year'] = pub_volume.index.year
pub_volume[pub_volume.index.year >=2015].corr()


Out[448]:
7-day-PVs num_pubbed year
7-day-PVs 1.000000 0.466825 0.158331
num_pubbed 0.466825 1.000000 0.380993
year 0.158331 0.380993 1.000000

In [451]:
pub_volume[pub_volume.index >='2016-01-01'].plot(kind='scatter',x='num_pubbed',y='7-day-PVs',title='7-Day PVs')


Out[451]:
<matplotlib.axes._subplots.AxesSubplot at 0x137c1f5d0>

In [452]:
import seaborn as sns
ax = sns.regplot(x='num_pubbed',y='7-day-PVs',data=pub_volume)


Let's check average performance when just looking at Simplereach Tag data


In [540]:
simplereach = pd.read_csv('simplereach-tags.csv')

In [541]:
simplereach.head(1)


Out[541]:
Tag Page Views Social Actions Social Referrals Facebook Actions Facebook CommentsBox Facebook Likes Facebook Shares Facebook Comments Twitter Actions ... Desktop Reddit Referrals Mobile Delicious Referrals Tablet Delicious Referrals Desktop Delicious Referrals Mobile Pinterest Referrals Tablet Pinterest Referrals Desktop Pinterest Referrals Mobile Google Plus Referrals Tablet Google Plus Referrals Desktop Google Plus Referrals
0 features 5009360 716287 2337637 659298 0 431705 126757 100830 46756 ... 163987 21 6 248 808 413 696 1523 428 2179

1 rows × 59 columns


In [542]:
simplereach = simplereach.set_index('Tag')

In [546]:
total_tagged2 = total_tagged

In [547]:
total_tagged2.head(4)


Out[547]:
num_tagged
year 5568818.0
news 461.0
upper_quartile 427.0
features 356.0

In [548]:
total_tagged2.index = [x.replace('-',' ') for x in total_tagged.index]

simplereach = simplereach.join(total_tagged2)

In [549]:
simplereach['mean-PVs'] = simplereach['Page Views'] // simplereach['num_tagged']
simplereach['mean-shares'] = simplereach['Facebook Shares'] // simplereach['num_tagged']

In [550]:
simplereach = simplereach[['mean-PVs','mean-shares','num_tagged']]

In [622]:
simplereach[simplereach['num_tagged'] > 5].sort_values('mean-PVs',ascending=False)


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-622-46d81dbbd256> in <module>()
----> 1 simplereach['space'][(simplereach['num_tagged'] > 5)].sort_values('mean-PVs',ascending=False)

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1343         res = cache.get(item)
   1344         if res is None:
-> 1345             values = self._data.get(item)
   1346             res = self._box_item_values(item, values)
   1347             cache[item] = res

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 'space'

In [554]:
#regular_features = [x.replace('-',' ') for x in regular_features]
simplereach.ix[regular_features].sort_values('mean-PVs',ascending=False)


Out[554]:
mean-PVs mean-shares num_tagged
Tag
maps 18568.0 545.0 60.0
naturecultures 14334.0 314.0 27.0
features 14071.0 356.0 356.0
other capitals of the world 13825.0 391.0 12.0
visual 12602.0 435.0 117.0
video wonders 12324.0 329.0 47.0
video 10834.0 231.0 102.0
list 8637.0 195.0 85.0
news 8146.0 225.0 461.0
extra mile 7989.0 248.0 16.0
found 7756.0 241.0 167.0
columns 7465.0 222.0 212.0
female explorers 6573.0 343.0 15.0
animals 6086.0 259.0 165.0
100 wonders 5687.0 90.0 44.0
objects of intrigue 5029.0 212.0 79.0
fleeting wonders 4452.0 115.0 156.0
art 4317.0 139.0 92.0
places you can no longer go 3948.0 38.0 44.0
morbid monday 1537.0 32.0 45.0

In [ ]:


In [ ]:

Let's run some regression analysis on our tag_analysis DataFrame


In [135]:
from sklearn import linear_model

In [136]:
from sklearn import metrics

In [137]:
tag_analysis.fillna(value=0,inplace=True)

In [138]:
y = tag_analysis.upper_quartile
X = tag_analysis.drop(['pageviews','published','upper_quartile'],axis=1)

In [139]:
from sklearn import cross_validation

In [140]:
kf = cross_validation.KFold(len(tag_analysis),n_folds=5)
scores = []
for train_index, test_index in kf:
    lr = linear_model.LogisticRegression().fit(X.iloc[train_index],y.iloc[train_index])
    scores.append(lr.score(X.iloc[test_index],y.iloc[test_index]))
print "average accuracy for LogisticRegression is", np.mean(scores)
print "average of the set is: ", np.mean(y)


average accuracy for LogisticRegression is 0.847468758494
average of the set is:  0.151795236402

In [141]:
lr_scores = lr.predict_proba(X)[:,1]

In [142]:
print metrics.roc_auc_score(y,lr_scores)


0.672637614814

In [143]:
print metrics.roc_auc_score(y,lr_scores)


0.672637614814

In [144]:
lr_scores


Out[144]:
array([ 0.176529  ,  0.14691911,  0.14657648, ...,  0.14657648,
        0.23156819,  0.14657648])

In [145]:
coefficients = pd.DataFrame(zip(X.columns,lr.coef_[0]),columns=['tags','coefficients'])
probabilities = pd.DataFrame(zip(X.columns,lr_scores),columns=['tags','probabilities'])

In [146]:
probabilities.sort_values('probabilities',ascending=False)


Out[146]:
tags probabilities
33 news 0.189916
0 100-wonders 0.176529
1 31-days-of-halloween 0.146919
8 cemeteries 0.146576
51 war 0.146576
21 garbage-week 0.146576
10 churches 0.146576
52 wwii 0.146576
3 animals 0.146576
2 abandoned 0.146576
24 libraries 0.091183
11 columns 0.091183
47 underground-week 0.091183
22 holidays 0.091183
38 places-you-can-no-longer-go 0.090890
12 crime-and-punishment 0.090599
35 objects-of-intrigue 0.089517
39 politics 0.084188
6 birds 0.061686
9 cheat-week 0.059556
5 art 0.059297
36 obscura-day 0.056281
37 oceans 0.056281
7 books 0.056281
40 rites-and-rituals 0.056281
32 music 0.056281
43 society-adventures 0.056281
44 space 0.056281
48 video 0.056281
49 video-wonders 0.056281
50 visual 0.056281
41 ruins 0.056281
30 morbid-monday 0.056281
19 food 0.056281
25 list 0.056281
13 curious-fact-of-the-week 0.056281
14 death 0.056281
16 features 0.056281
20 found 0.056281
26 magic 0.056281
4 architecture 0.056281
28 maps 0.056281
29 medicine 0.056281
31 museums 0.051961
15 exploration 0.040709
27 map-monday 0.037582
42 science 0.033551
23 islands 0.033012
18 fleeting-wonders 0.033012
34 notes-from-the-field 0.029644
45 sports 0.029644
17 film 0.022731
46 transportation 0.020998

In [147]:
coefficients.sort_values('coefficients',ascending=False)


Out[147]:
tags coefficients
52 wwii 1.065043
28 maps 0.848714
7 books 0.836221
47 underground-week 0.575863
29 medicine 0.567487
2 abandoned 0.562231
9 cheat-week 0.517251
24 libraries 0.385578
30 morbid-monday 0.327623
4 architecture 0.303256
49 video-wonders 0.293608
25 list 0.234942
16 features 0.221080
11 columns 0.204949
31 museums 0.159610
37 oceans 0.134381
10 churches 0.129452
14 death 0.102822
41 ruins 0.055650
32 music 0.055465
21 garbage-week 0.013623
3 animals 0.002736
5 art 0.000338
19 food -0.007075
50 visual -0.013276
17 film -0.021714
1 31-days-of-halloween -0.023128
20 found -0.043365
51 war -0.084437
46 transportation -0.102804
27 map-monday -0.204296
22 holidays -0.280851
26 magic -0.340270
33 news -0.403831
35 objects-of-intrigue -0.430410
23 islands -0.464792
12 crime-and-punishment -0.488901
48 video -0.520194
0 100-wonders -0.537564
44 space -0.541108
42 science -0.557842
18 fleeting-wonders -0.601600
8 cemeteries -0.668934
39 politics -0.709310
36 obscura-day -0.759159
40 rites-and-rituals -0.922080
6 birds -0.941553
15 exploration -0.959669
34 notes-from-the-field -0.964113
43 society-adventures -1.238194
45 sports -1.278518
13 curious-fact-of-the-week -1.470357
38 places-you-can-no-longer-go -1.730003

In [148]:
tag_analysis[tag_analysis['100-wonders'] ==1].describe()


Out[148]:
100-wonders 31-days-of-halloween abandoned animals architecture art birds books cemeteries cheat-week ... sports transportation underground-week video video-wonders visual war wwii pageviews upper_quartile
count 44.0 44.0 44.000000 44.0 44.000000 44.0 44.000000 44.0 44.000000 44.0 ... 44.0 44.0 44.0 44.000000 44.0 44.0 44.000000 44.0 44.000000 44.000000
mean 1.0 0.0 0.045455 0.0 0.022727 0.0 0.022727 0.0 0.045455 0.0 ... 0.0 0.0 0.0 0.840909 0.0 0.0 0.022727 0.0 3817.977273 0.068182
std 0.0 0.0 0.210707 0.0 0.150756 0.0 0.150756 0.0 0.210707 0.0 ... 0.0 0.0 0.0 0.369989 0.0 0.0 0.150756 0.0 3259.347741 0.254972
min 1.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 792.000000 0.000000
25% 1.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 ... 0.0 0.0 0.0 1.000000 0.0 0.0 0.000000 0.0 1737.750000 0.000000
50% 1.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 ... 0.0 0.0 0.0 1.000000 0.0 0.0 0.000000 0.0 2771.000000 0.000000
75% 1.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 ... 0.0 0.0 0.0 1.000000 0.0 0.0 0.000000 0.0 4522.750000 0.000000
max 1.0 0.0 1.000000 0.0 1.000000 0.0 1.000000 0.0 1.000000 0.0 ... 0.0 0.0 0.0 1.000000 0.0 0.0 1.000000 0.0 16769.000000 1.000000

8 rows × 55 columns


In [149]:
tag_analysis.head()


Out[149]:
100-wonders 31-days-of-halloween abandoned animals architecture art birds books cemeteries cheat-week ... transportation underground-week video video-wonders visual war wwii published pageviews upper_quartile
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2015-08-01 651.0 0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-06-09 3505.0 0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2014-05-12 840.0 0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2015-12-30 4037.0 0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 2016-01-07 1620.0 0

5 rows × 56 columns

Now let's try it with KNN


In [150]:
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [ ]:
params = {'n_neighbors': [x for x in range(2,200,1)],
          'weights': ['distance','uniform']}
gs = GridSearchCV(estimator=KNeighborsClassifier(),param_grid=params,n_jobs=8,cv=10)
gs.fit(X,y)
print gs.best_params_
print gs.best_score_

In [160]:
print type(gs.best_estimator_)


<class 'sklearn.neighbors.classification.KNeighborsClassifier'>

In [161]:
knn = gs.best_estimator_.fit(X,y)

In [162]:
knn_scores = knn.predict_proba(X)[:,1]

In [163]:
print np.mean(knn_scores)


0.160428622213

In [164]:
print np.mean(lr_scores)


0.134654700331

In [165]:
knn_probabilities = pd.DataFrame(zip(X.columns,knn_scores),columns=['tags','probabilities'])

In [166]:
knn_probabilities.sort_values('probabilities',ascending=False)


Out[166]:
tags probabilities
52 wwii 0.250000
8 cemeteries 0.250000
2 abandoned 0.250000
3 animals 0.250000
51 war 0.250000
21 garbage-week 0.250000
10 churches 0.250000
33 news 0.214286
0 100-wonders 0.214286
12 crime-and-punishment 0.178571
35 objects-of-intrigue 0.142857
42 science 0.107143
9 cheat-week 0.071429
5 art 0.071429
34 notes-from-the-field 0.071429
36 obscura-day 0.071429
37 oceans 0.071429
39 politics 0.071429
40 rites-and-rituals 0.071429
41 ruins 0.071429
43 society-adventures 0.071429
7 books 0.071429
44 space 0.071429
45 sports 0.071429
46 transportation 0.071429
47 underground-week 0.071429
48 video 0.071429
49 video-wonders 0.071429
50 visual 0.071429
4 architecture 0.071429
32 music 0.071429
31 museums 0.071429
30 morbid-monday 0.071429
20 found 0.071429
11 columns 0.071429
13 curious-fact-of-the-week 0.071429
14 death 0.071429
15 exploration 0.071429
16 features 0.071429
17 film 0.071429
18 fleeting-wonders 0.071429
19 food 0.071429
6 birds 0.071429
29 medicine 0.071429
22 holidays 0.071429
23 islands 0.071429
24 libraries 0.071429
25 list 0.071429
27 map-monday 0.071429
28 maps 0.071429
26 magic 0.071429
38 places-you-can-no-longer-go 0.035714
1 31-days-of-halloween 0.000000

Let's check the roc_auc scores for both the knn and logistic regression models.


In [167]:
print 'knn', metrics.roc_auc_score(y,knn_scores)
print 'lr', metrics.roc_auc_score(y,lr_scores)


knn 0.672067348369
lr 0.672637614814

Looks like they give similar scores, but the scores are easily manipulated by changing the threshold for the number of articles per tag and by changing the threshold for "success" (currently set at > 10,000 Pageviews).


In [62]:
probabilities = probabilities.set_index('tags')

In [63]:
probabilities = probabilities.join(total_tagged)

In [64]:
probabilities.to_csv('tag-probabilities-logisticregression.csv')

Now let's try RandomForest


In [65]:
from sklearn.ensemble import RandomForestClassifier

In [66]:
params = {'max_depth': np.arange(20,100,2),
          'min_samples_leaf': np.arange(90,200,2),
          'n_estimators': 20}
gs1 = GridSearchCV(RandomForestClassifier(),param_grid=params, cv=10, scoring='roc_auc',n_jobs=8,verbose=1)
gs1.fit(X,y)
print gs1.best_params_
print gs1.best_score_


Fitting 10 folds for each of 4400 candidates, totalling 44000 fits
[Parallel(n_jobs=8)]: Done  52 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 352 tasks      | elapsed:    2.7s
[Parallel(n_jobs=8)]: Done 852 tasks      | elapsed:    6.6s
[Parallel(n_jobs=8)]: Done 1552 tasks      | elapsed:   12.5s
[Parallel(n_jobs=8)]: Done 2452 tasks      | elapsed:   21.4s
[Parallel(n_jobs=8)]: Done 3552 tasks      | elapsed:   31.1s
[Parallel(n_jobs=8)]: Done 4852 tasks      | elapsed:   43.3s
[Parallel(n_jobs=8)]: Done 6352 tasks      | elapsed:   57.3s
[Parallel(n_jobs=8)]: Done 8052 tasks      | elapsed:  1.2min
[Parallel(n_jobs=8)]: Done 9952 tasks      | elapsed:  1.5min
[Parallel(n_jobs=8)]: Done 12052 tasks      | elapsed:  1.9min
[Parallel(n_jobs=8)]: Done 14352 tasks      | elapsed:  2.2min
[Parallel(n_jobs=8)]: Done 16852 tasks      | elapsed:  2.6min
[Parallel(n_jobs=8)]: Done 19552 tasks      | elapsed:  3.0min
[Parallel(n_jobs=8)]: Done 22452 tasks      | elapsed:  3.5min
[Parallel(n_jobs=8)]: Done 25552 tasks      | elapsed:  4.0min
[Parallel(n_jobs=8)]: Done 28852 tasks      | elapsed:  4.5min
[Parallel(n_jobs=8)]: Done 32352 tasks      | elapsed:  5.0min
[Parallel(n_jobs=8)]: Done 36052 tasks      | elapsed:  5.6min
[Parallel(n_jobs=8)]: Done 39952 tasks      | elapsed:  6.3min
[Parallel(n_jobs=8)]: Done 44000 out of 44000 | elapsed:  6.9min finished
{'max_depth': 95, 'min_samples_leaf': 116}
0.568912853263

In [67]:
rf = RandomForestClassifier(gs1.best_estimator_)
rf.fit(X,y)
probs = rf.predict_proba(X)[:,1]
print rf.score(X,y)
print metrics.roc_auc_score(y,probs)


0.753999289015
0.551845977331

In [69]:
probs = pd.DataFrame(zip(X.columns,probs),columns=['tags','probabilities'])

In [71]:
probs.sort_values('probabilities',ascending=False)


Out[71]:
tags probabilities
0 100-wonders 0.250698
27 map-monday 0.250698
29 medicine 0.250698
30 morbid-monday 0.250698
31 museums 0.250698
32 music 0.250698
33 news 0.250698
34 notes-from-the-field 0.250698
35 objects-of-intrigue 0.250698
36 obscura-day 0.250698
37 oceans 0.250698
38 places-you-can-no-longer-go 0.250698
39 politics 0.250698
40 rites-and-rituals 0.250698
41 ruins 0.250698
42 science 0.250698
43 society-adventures 0.250698
44 space 0.250698
45 sports 0.250698
46 transportation 0.250698
47 underground-week 0.250698
48 video 0.250698
49 video-wonders 0.250698
50 visual 0.250698
51 war 0.250698
28 maps 0.250698
26 magic 0.250698
1 31-days-of-halloween 0.250698
25 list 0.250698
2 abandoned 0.250698
3 animals 0.250698
4 architecture 0.250698
5 art 0.250698
6 birds 0.250698
7 books 0.250698
8 cemeteries 0.250698
9 cheat-week 0.250698
10 churches 0.250698
11 columns 0.250698
12 crime-and-punishment 0.250698
13 curious-fact-of-the-week 0.250698
14 death 0.250698
15 exploration 0.250698
16 features 0.250698
17 film 0.250698
18 fleeting-wonders 0.250698
19 food 0.250698
20 found 0.250698
21 garbage-week 0.250698
22 holidays 0.250698
23 islands 0.250698
24 libraries 0.250698
52 wwii 0.250698

Let's try the Logistic Model but with more tags


In [144]:
tag_analysis2 = article_set.drop(total_tagged[total_tagged.num_tagged < 15].index,axis=1)

In [190]:
tag_analysis2['ten_thousand'] = [1 if x > 10000 else 0 for x in tag_analysis2.pageviews]

In [191]:
tag_analysis2.fillna(value=0,inplace=True)
y2 = tag_analysis2.ten_thousand
X2 = tag_analysis2.drop(['pageviews','upper_quartile','ten_thousand'],axis=1)

In [192]:
kf2 = cross_validation.KFold(len(tag_analysis2),n_folds=5)
scores2 = []
for train_index, test_index in kf2:
    lr2 = linear_model.LogisticRegression().fit(X2.iloc[train_index],y2.iloc[train_index])
    scores2.append(lr2.score(X2.iloc[test_index],y2.iloc[test_index]))
print "average accuracy for LogisticRegression is", np.mean(scores2)
print "average of the set is: ", np.mean(y2)


average accuracy for LogisticRegression is 0.846403039133
average of the set is:  0.151795236402

In [193]:
print tag_analysis2.shape
print y2.shape
print X2.shape


(2813, 136)
(2813,)
(2813, 133)

In [194]:
lr_scores2 = lr2.predict_proba(X2)[:,1]

In [195]:
lr2_probs = pd.DataFrame(zip(X2.columns,lr_scores2),columns=['tags','probabilities'])

In [196]:
lr2_probs.sort_values('probabilities',ascending=False)


Out[196]:
tags probabilities
130 women 0.371218
3 aircraft 0.330070
125 visual 0.313056
110 sports 0.303625
71 mummies 0.288925
99 saints 0.272294
115 time-week 0.263755
98 ruins 0.241347
109 space 0.235258
100 science 0.235128
70 mountains 0.233971
103 ships 0.215629
119 tunnels 0.211667
118 trees 0.207395
111 statues 0.204833
116 trains 0.204833
60 magic 0.197443
107 society-adventures 0.191618
0 100-wonders 0.188841
57 libraries 0.182539
112 subterranean 0.182539
132 wwii 0.166283
131 world-s-fair 0.152512
105 skeletons 0.147225
53 insects 0.145729
61 map-monday 0.142249
77 nasa 0.141422
82 no-ones-watching-week 0.141422
80 new-york-city 0.141422
79 naturecultures 0.141422
... ... ...
5 ancient 0.043366
28 design 0.043092
9 architecture 0.042018
52 infrastructure 0.041597
6 animals 0.041193
32 escape-week 0.039818
13 books 0.037367
29 dinosaurs 0.037367
41 fleeting-wonders 0.037367
101 sculptures 0.036208
35 extra-mile 0.035296
20 churches 0.033240
25 crime-and-punishment 0.033240
16 cemeteries 0.033240
14 cats 0.033240
49 halloween 0.033240
50 holidays 0.033240
44 games 0.031995
15 caves 0.031769
42 food 0.031127
27 death 0.028319
30 disaster-areas 0.027857
37 features 0.023676
23 computers 0.021552
34 exploration 0.017840
45 garbage 0.017840
18 china 0.016213
4 airplanes 0.013174
17 cheat-week 0.009833
46 garbage-week 0.006122

133 rows × 2 columns


In [197]:
metrics.roc_auc_score(y2,lr2.predict_proba(X2)[:,1])


Out[197]:
0.73045144294096509

In [198]:
lr2_probs = lr2_probs.set_index('tags')

In [199]:
lr2_probs = lr2_probs.join(total_tagged)

In [206]:
plt.figure(figsize=(10,10))
plt.scatter(lr2_probs.num_tagged,lr2_probs.probabilities)
plt.show()



In [201]:
lr2_probs = lr2_probs.sort_values('probabilities',ascending=False)

In [202]:
lr2_probs = lr2_probs.reset_index()

In [203]:
lr2_probs.to_csv('min15tags_min10000pvs.csv')

In [204]:
lr2_probs.shape


Out[204]:
(133, 3)

In [207]:
lr2_probs


Out[207]:
tags probabilities num_tagged
0 women 0.371218 16.0
1 aircraft 0.330070 16.0
2 visual 0.313056 117.0
3 sports 0.303625 45.0
4 mummies 0.288925 27.0
5 saints 0.272294 19.0
6 time-week 0.263755 27.0
7 ruins 0.241347 36.0
8 space 0.235258 105.0
9 science 0.235128 58.0
10 mountains 0.233971 21.0
11 ships 0.215629 20.0
12 tunnels 0.211667 22.0
13 trees 0.207395 29.0
14 statues 0.204833 18.0
15 trains 0.204833 17.0
16 magic 0.197443 40.0
17 society-adventures 0.191618 57.0
18 100-wonders 0.188841 44.0
19 libraries 0.182539 31.0
20 subterranean 0.182539 18.0
21 wwii 0.166283 33.0
22 world-s-fair 0.152512 20.0
23 skeletons 0.147225 20.0
24 insects 0.145729 15.0
25 map-monday 0.142249 35.0
26 nasa 0.141422 18.0
27 no-ones-watching-week 0.141422 17.0
28 new-york-city 0.141422 28.0
29 naturecultures 0.141422 27.0
... ... ... ...
103 ancient 0.043366 15.0
104 design 0.043092 15.0
105 architecture 0.042018 43.0
106 infrastructure 0.041597 20.0
107 animals 0.041193 165.0
108 escape-week 0.039818 26.0
109 books 0.037367 37.0
110 dinosaurs 0.037367 19.0
111 fleeting-wonders 0.037367 156.0
112 sculptures 0.036208 19.0
113 extra-mile 0.035296 16.0
114 churches 0.033240 32.0
115 crime-and-punishment 0.033240 52.0
116 cemeteries 0.033240 65.0
117 cats 0.033240 19.0
118 halloween 0.033240 16.0
119 holidays 0.033240 33.0
120 games 0.031995 20.0
121 caves 0.031769 23.0
122 food 0.031127 68.0
123 death 0.028319 43.0
124 disaster-areas 0.027857 25.0
125 features 0.023676 356.0
126 computers 0.021552 22.0
127 exploration 0.017840 30.0
128 garbage 0.017840 19.0
129 china 0.016213 19.0
130 airplanes 0.013174 22.0
131 cheat-week 0.009833 38.0
132 garbage-week 0.006122 31.0

133 rows × 3 columns


In [ ]: